Red wine exploration by Thuy Quach

Abstract

Why some red wines taste better than others? Just because the wine tasters say so or there is another way to tell. Can we tell what make great wine or bad wine from their chemical properties? And if yes, under what conditions the quality of red wines is the best.

This is what we are going to explore: relationship of chemical properties with wine quality.

The analysis included: data structure, statistical summary, distribution plots, box plots of each variables vs. quality, correlation matrix and scatter plots, final plots and data exploring the strong correlated variables, and reflections.

Dataset

The data set using in this analysis can be found here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt.

## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//RtmpNpQrqo/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//RtmpNpQrqo/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//RtmpNpQrqo/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//RtmpNpQrqo/downloaded_packages
## [1] "/Users/thuy/Google Drive/Data-analysis-with-R"

Summary of the data

First, let’s see the total of the wine data is:

## [1] 1599

samples.

Then, let’s explore the all variables.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

X is data entry number and quality is the output of the analysis. So, there were 11 total variables. The data is in wide format.

How is about the structure of the data?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Quality was measured as factor integer. All other variables were numerical data.

Statistical summary of the data was shown below.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality was range from 3 to 8. Residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide had very large range of data. Do these variables influence wine quality?

Mean of all variables grouped by quality

Mean is important value to investigate data. Let’s write a function to calculate the mean of all variables grouped by a variable of interest.

Here was the mean values of all variables grouped by quality.

## Source: local data frame [6 x 12]
## 
##   quality fixed.acidity volatile.acidity citric.acid residual.sugar
##     (int)         (dbl)            (dbl)       (dbl)          (dbl)
## 1       3      8.360000        0.8845000   0.1710000       2.635000
## 2       4      7.779245        0.6939623   0.1741509       2.694340
## 3       5      8.167254        0.5770411   0.2436858       2.528855
## 4       6      8.347179        0.4974843   0.2738245       2.477194
## 5       7      8.872362        0.4039196   0.3751759       2.720603
## 6       8      8.566667        0.4233333   0.3911111       2.577778
## Variables not shown: chlorides (dbl), free.sulfur.dioxide (dbl),
##   total.sulfur.dioxide (dbl), density (dbl), pH (dbl), sulphates (dbl),
##   alcohol (dbl)

Univariate Analysis

Distribution of individual variables by histogram and density:

First, let us explore the distributions of each variables using ggplot.

The data is in the format of wide data which make difficult for R to draw multiple variable plots. Therefore, I reshaped the data into long format.

All variables

Some of the variables seem to follow normal distribution such as density, pH, alcohol, volatile.acidity, sulphates and fix.acidity. Few others were right skewed distribution such as residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, sulphate, chloride.

Quality

The histogram showed wine quality is in range of 3 to 8. There was no wine with quality 1, 2, 9 and 10. Most of the wine samples had wine quality of 5 (681 samples) and 6 (638 samples).

##          n
## 1 82.48906

There was 82.49 % of wines had quality of 5 or 6.

Data correlations

Let us run the correlation matrix to see what chemical properties have strong relationships with wine quality and also with each others using ggpairs. It was difficult to plot ggpairs on all variables because the space allotted to the plot couldn’t hold 12^2 variables, so I created three groups and made sure that the variable “quality” (col 13) was presented in all.

We learned that any correlations above 0.3 is meaningful and 0.7 is pretty strong. Let us see if we could find any in the below results.

Correlation efficients between quality with volatile.acidity was -0.391, citric.acid with fixed.acidity was 0.672, citric.acid with volatile.acidity was -0.552.

Correlation efficient between total.sulfur.dioxide and free.sulfur.dioxide was 0.668.

Correlation efficient between quality and alcohol was 0.476, pH and density was -0.342.

Bivariate analysis plots

What chemical properties correlated with each others?

From previous data correlations analysis, we found that there were some chemical properties strongly correlated with each others. Let explore them by scatter plots with linear regression line (blue line).

Since we will apply the same kind of visualization for all the chemical properties, let’s write a function to plot it.

Bivariate plot function:

Citric.acid and fixed.acidity

Here was the relationship of citric.acid and fixed.acidity investigated by the scatter plot with linear regression.

Statistic summary of citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Statistic summary of fixed.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

As showed in the plot, citric.acid increased and the fixed.acidity were proportionally increased together (see the blue linear regression line). Citric.acid ranged from 0 to 1 g/dm^3 while fixed.acidity ranged from 4.5 to 15.9 g/dm^3. It also could be explainable since citric.acid is an acid that leads to increased the fixed.acidity of the wine. Previous correlation analysis supported the results as correlation coefficient of the two chemical properties was 0.672.

Total.sulfur.dioxide and free.sulfur.dioxide

Here was the relationship of total.sulfur.dioxide and free.sulfur.dioxide investigated by the scatter plot with linear regression.

Statistic summary total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Statistic summary of free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

We could see from the plot that as total.sulfur.dioxide increased, the free.sulfur.dioxide increased. Total.sulfur.dioxide ranged from 6 to 289 g/dm^3 while free.sulfur.dioxide ranged from 1 to 72 g/dm^3. It also could be explainable since free.sulfur.dioxide is a part of the total.sulfur.dioxide. Previous correlation analysis supported the results as correlation coefficient the two chemicals was 0.668.

pH and density:

Here was the relationship of pH and density investigated by the scatter plot with linear regression.

Statistical summary of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Statistical summary of density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Though the correlation was not strong, we could notice that as pH increased, the density increased. pH ranged from 2.74 to 4.01 while density ranged from 1.004 to 0.990. The range of density was very small (around 0.014). Previous correlation analysis supported the results as correlation coefficient the two chemical properties was -0.342.

What chemical properties influence wine quality?

From the above correlation analysis, I found only alcohol and volatile.acidity had correlation coefficient bigger than 0.3 with quality. Since we are interested in what make best wine, it is important to consider some other chemical properties which may have some impacts.

Let’s see the below results.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632

We could see that there were 6 chemical properties (volatile.acidity, total.sulfur.dioxide, pH, free.sulfur.dioxide, density, chlorides) have negative correlation with quality. It suggested that those chemical properties make wine taste worse. Among those properties, volatile.acidity had the most impact with correlation of -0.391. While sulphates, residual.sugar, fixed.acidity citric.acid, alcohol make wine taste better. Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.

Correlation of chemical properties vs. wine quality by boxplots:

From the box plots, it looked like alcohol, sulphates, volatile.acidity and citric.acid might have impacts on the quality of wines. The results were consistent with previous correlation analysis.

Let’s first compare the distributions of the chemical properties code. Here is the function to make a density plot.

Bivariate plot function

Alcohol and quality

Let’s compare the distributions of alcohol for different wine qualities

The distributions of alcohol were similar and almost normal for all wine qualities except 5 where the distribution was much narrower.

Let’s compute mean of alcohol with other qualities and compare overall on a boxplot. First write a function to plot a boxplot with mean line.

Boxplot with mean line data

Boxplot of alcohol vs. quality with green mean line was showed below. The text was mean value of alcohol for each quality.

As alcohol increased, the wine quality increase from 3 to 8 except for quality of 5. Particularly, average alcohol was increased from 9.955 to 12.094 (1.2 times) when wine quality increased from 3 to 8, except for quality of 5 where the average alcohol was 9.9.

Citric.acid and quality

Let’s compare the distributions of citric.acid for different wine qualities.

We could see the mean of citric.acid shifted to the right with wine quality increased.

Let’s compute the mean of citric.acid for all wine quality and compare overall of the data using BoxplotWithMean function.

The plot showed the relationship citric.acid vs. quality with green mean line of citric.acid. The text was mean value of citric.acid for each quality. We could see that as the wine quality increased from 3 to 8, there was an increase in average of citric.acid. It was clearly to see the average value of citric.acid increased from 0.171 to 0.391 (2.3 times) when quality increased from 3 to 8.

Sulphates and quality

Let’s compare the distributions of citric.acid for different wine qualities

We could see the distributions of sulphates were similar and the mean of sulphates shifted to the right with wine quality increased.

Let’s summary the mean of sulphates and combine the data on a boxplot.

The plot showed the relationship sulphates vs. quality with green mean line of sulphates. The text was mean value of sulphates for each quality. We could see that as the sulphates increased from 3 to 8, there was an increase quality. It was clearly to see the average value of sulphates increased from 0.570 to 0.768 (1.3 times) when quality increased from 3 to 8.

Volatile.acidity and quality

Let’s compare the distributions of volatile.acidity for different wine qualities

We could see the distributions of volatile.acidity were similar and the mean of volatile.acidity shifted to the right with wine quality increased.

Let’s summary and arrange the mean of volatile.acidity, and combine the data on a boxplot.

The plot showed the relationship volatile.acidity vs. quality with red mean line of volatile.acidity. The text was mean value of volatile acidity for each quality.As volatile.acidity decreased, there was an increase in wine quality. It was clearly to see the average value of volatile.acidity decreased from 0.884 to 0.404 (2.2 times) when quality increased from 3 to 8.

Summary of bivariate analysis:

There were strong correlations among the chemical properties such as citric.acid with fixed.acidity (0.672), citric.acid with volatile.acidity (-0.552), total.sulfur.dioxide and free.sulfur.dioxide (0.668), and pH and density (-0.342).

There were also strong correlations of some chemicals with quality such as quality with volatile.acidity (-0.391), quality and alcohol (0.476), quality and sulphates (0.251), quality and citric.acid (0.226).

Multivariate Plots Section

It is important to investigate multivariate analysis. As previous bivariate analysis, we found that some chemical correlated well with each others or with quality. In this section, we analyzed how our feature of interest - quality varies with other chemical properties.

In order to see simplify and see clearer relationships, I grouped the quality by their average chemical properties and add a new rating variable which groups the quality into three groups.

The above table showed the average value for each chemical properties for every wine quality.

Let’s see how the variables vary with quality and each others.

It was not so clear to see what chemical properties varied with each others and with quality. It could be due to the huge difference in value of each variables. It is better to investigate two of them in a plot. Also, it could be useful to group quality into smaller range of order.

Group the quality in three groups using new variable rating

First, let’s add a new variable ‘rating’ which group wine quality into 3 categories: ‘bad’ for quality <= 4, ‘good’ for quality >= 5, ‘very good’ for quality >=7.

Let’s summarize the wine by rating.

## Source: local data table [3 x 2]
## 
##      rating n_obs
##       (chr) (int)
## 1      good  1319
## 2 very good   217
## 3       bad    63

So, there was 217 very good wines, 1319 good wines and 63 bad wines.

Average of all variables grouped by rating

## Source: local data frame [3 x 13]
## 
##      rating fixed.acidity volatile.acidity citric.acid residual.sugar
##       (chr)         (dbl)            (dbl)       (dbl)          (dbl)
## 1       bad      7.871429        0.7242063   0.1736508       2.684921
## 2      good      8.254284        0.5385595   0.2582638       2.503867
## 3 very good      8.847005        0.4055300   0.3764977       2.708756
## Variables not shown: chlorides (dbl), free.sulfur.dioxide (dbl),
##   total.sulfur.dioxide (dbl), density (dbl), pH (dbl), sulphates (dbl),
##   alcohol (dbl), quality (dbl)

The table show the average value of each chemical properties for each wine rating.

We could use the data for plotting relationship of chemical properties vs. rating later.

Citric.acid and fixed.acidity correlation code by quality

Now, let’s see how citric.acid and fixed.acidity vary with each other and with wine quality.

The scatter plot was showed how fixed.acidity and citric.acid varied with wine quality. I added the mean values of fixed.acidity and citric.acid for all qualities as red shapes. The blue line was regression line. We could see that the trend of proportional increase between fixed.acidity and citric.acid. Let’s see if it is also true for wine quality by zooming up regression line of mean data.

Regression line of mean data.

We could clearly see the trend that the higher the wine quality the higher of both fixed.acidity and citric.acid were. Increasing average fixed.acidity from 7.78 to 8.57 and average citric.acid from 0.17 to 0.39 lead to increase wine quality from 4 to 8. It is supported that with both fix.acidity and citric.acid were strongly correlated with correlation coefficient of 0.672, and both chemicals were also correlated with quality with correlation of 0.124 and 0.226 respectively.

Let’s group the data into rating.

The plot show a straight green line from rating ‘bad’ to ‘good’ when both fixed.acidity and citric.acid were increased. So, we could conlude that fixed.acidity, citric.acid and quality were positively correlated with each others.

Total.sulfur.dioxide and free.sulfur.dioxide code by quality

Statistical summary of total.sulfur.dioxide and free.sulfur.dioxide code by quality.

## Source: local data frame [6 x 3]
## 
##   free.sulfur.dioxide total.sulfur.dioxide quality
##                 (dbl)                (dbl)   (int)
## 1            11.00000             24.90000       3
## 2            12.26415             36.24528       4
## 3            16.98385             56.51395       5
## 4            15.71160             40.86991       6
## 5            14.04523             35.02010       7
## 6            13.27778             33.44444       8

Now, let’s see how total.sulfur.dioxide and free.sulfur.dioxide vary with each other and with wine quality.

The scatter plot was showed how total.sulfur.dioxide and free.sulfur.dioxide varied by quality. I added the mean values of total.sulfur.dioxide and free.sulfur.dioxide for all qualities as red shapes. The blue line was regression line. We could see that the trend of proportional increase between total.sulfur.dioxide and free.sulfur.dioxide. Let’s see if it is also true for wine quality by zooming up regression line of mean data.

Regression line of mean data.

It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively). It was also noted that with the average total.sulfur.dioxide were similar in both bad wine and very good wine while the free.sulfur.dioxide were around 2 g/dm^3 higher in very good wine. When the both concentration of the chemicals increased further, the wine quality reduced. It was suggested that low concentration of the chemicals make wine taste bad, however too much of them (above 35 g/dm^3 for total.sulfur.dioxide, 14 g/dm^3 for free.sulfur.dioxide ) reduced wine quality.

Let’s group the data into rating and see its relationship.

We could see that the wine quality was ‘very good’ in the middle data (blue square shape). It was clear that free.sulfur.dioxide and total.sulfur.dioxide were not correlated well with wine rating.

pH and density code by quality:

Statistical summary of pH, density and quality.

## Source: local data frame [6 x 3]
## 
##         pH   density quality
##      (dbl)     (dbl)   (int)
## 1 3.398000 0.9974640       3
## 2 3.381509 0.9965425       4
## 3 3.304949 0.9971036       5
## 4 3.318072 0.9966151       6
## 5 3.290754 0.9961043       7
## 6 3.267222 0.9952122       8

Now, let’s see how pH and density vary with each other and with wine quality.

The scatter plot was showed how pH and density varied by quality. I added the mean values of pH and density for all qualities as red shapes. The blue line was regression line. We could see that the trend of proportional increase between pH and density. Let’s see if it is also true for wine quality by zooming up regression line of mean data.

Regression line of mean data.

pH and density was slightly correlated with each other but not with quality. Low concentration of both pH and density lead to higher quality. Higher pH seems reduce quality while it was not clear in density.

We could see that the density was changed in a range from 0.997 to 0.995 g/dm^3. It was very small range though the plot showed wine quality increased when density decreased. And pH changed from 3.398 to 3.267 while wine quality increased from 3 to 8. So, we could conclude that pH and quality has negative correlation.

Let’s group the quality into rating.

This graphs showed clear negative impact of pH with increasing pH reducing wine rating, while it was not clear for density.

Alcohol and sulphates code by quality:

Statistical summary of alcohol and sulphates varied with quality

## Source: local data frame [6 x 3]
## 
##   sulphates   alcohol quality
##       (dbl)     (dbl)   (int)
## 1 0.5700000  9.955000       3
## 2 0.5964151 10.265094       4
## 3 0.6209692  9.899706       5
## 4 0.6753292 10.629519       6
## 5 0.7412563 11.465913       7
## 6 0.7677778 12.094444       8

Now, let’s see how alcohol and sulphates varied with each other and with quality.

The scatter plot was showed how sulphates and alcohol varied by quality. I added the mean values of sulphates and alcohol for all qualities as red shapes. The blue line was regression line. We could see that sulphates and alcohol did have a positve correlation. Let’s see if it is also true for wine quality by zooming up regression line of mean data.

Regression line of mean data.

Sulphates and alcohol strongly correlated with each other. Increasing sulphates from 0.57 to 0.77 and alcohol from 9.96 to 12.09 lead to increase quality from 3 to 8.

Let’s group quality into rating.

The plot showed that with a little increased in alcohol and sulphates the wine rating increased from ‘bad’ to ‘good’. So, it could concluded that sulphates and alcohol had positive correlations with quality.

Volatile.acidity and total.sulfur.dioxide code by quality:

Statistical summary of volatile.acidity and total.sulfur.dioxide varied with quality

## Source: local data frame [6 x 3]
## 
##   volatile.acidity total.sulfur.dioxide quality
##              (dbl)                (dbl)   (int)
## 1        0.8845000             24.90000       3
## 2        0.6939623             36.24528       4
## 3        0.5770411             56.51395       5
## 4        0.4974843             40.86991       6
## 5        0.4039196             35.02010       7
## 6        0.4233333             33.44444       8

Now, let’s see how volatile.acidity and total.sulfur.dioxide varied with quality

The scatter plot was showed how total.sulfur.dioxide and volatile.acidity varied by quality. I added the mean values of total.sulfur.dioxide and volatile.acidity for all qualities as red shapes. We could see that total.sulfur.dioxide and volatile.acidity were not correlated with each other. Let’s see if it is also true for wine quality by zooming up regression line of mean data.

Regression line of mean data.

The total.sulfur.dioxide was low (around 25 to 35) for quality from 3-4 and 7-8. While volatile.acidity was strongly negative correlated with quality. The volatile.acidity was decreased from 0.84 to 0.42 while quality increased from 3 to 8.

Let’s group the data into rating.

It was clear that volatile.acidity had negative impact on wine rating with increase in volatile.acidity led to decrease in wine rating. However, it was no relationship between quality and total.sulfur.dioxide and also between volatile.acidity and total.sulfur.dioxide.

Multivariate Summary

pH and density was slightly correlated with each other but not with quality. Low concentration of both pH and density lead to higher quality. Higher pH seems reduce quality while it was not clear in density.

Total.sulfur.dioxide and volatile.acidity were not correlated with each other. Sulfur.dioxide was not correlated with wine quality while volatile.acidity was.

Some chemical correlated well with quality but not each others such as free.sulfur.dioxide and total.sulfur.dioxide. It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively).

Some chemical properties strongly correlated with each others and with wine quality, particularly:

  • Fixed.acidity and citric.acid strongly correlate with each other. Increasing average fixed.acidity from 7.78 to 8.57 and average citric.acid from 0.17 to 0.39 lead to increase wine quality from 4 to 8.

  • Sulphates and alcohol strongly correlated with each other. Increasing average sulphates from 0.57 to 0.77 and average alcohol from 9.96 to 12.09 lead to increase quality from 3 to 8.

Final Plots and Summary

We have explored the red wine data with many interesting questions about the data structures, data summary and how chemical properties vary with each others and with our feature of interest- quality. We have did statistical analysis and many different kinds of plots such as histogram, box plots, bar graph, etc. Let’s summarized the findings in there plots.

Plot One: Chemical properties highly influence wine quality

From plot 1, we could see that alcohol, citric.acid, fixed.acidity and sulphates positively influenced wine quality (green bar). Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.

Volatile.acidity, total.sulfur.dioxide, density, chlorides negatively influenced wine quality (red bar). Among those properties, volatile.acidity had the strongest impact with correlation of -0.391.

Plot Two

After finding alcohol and volatile.acidity have strongest impacts on wine quality. Let’s summarize their relationships with wine quality. The below plots were selected and improved from bivariate plots section.

Statistical summary of average alcohol and volatile.acidity vary with quality:

## Source: local data frame [6 x 3]
## 
##   quality   alcohol volatile.acidity
##     (int)     (dbl)            (dbl)
## 1       3  9.955000        0.8845000
## 2       4 10.265094        0.6939623
## 3       5  9.899706        0.5770411
## 4       6 10.629519        0.4974843
## 5       7 11.465913        0.4039196
## 6       8 12.094444        0.4233333

Increasing volatile.acidity from 0.404 to 0.884 significantly reduced wine quality from 8 to 3, while increasing alcohol from 9.955 to 12.094 increased wine quality from 3 to 8. This results were consistent with the correlation findings where volatile.acidity had correlation coefficient of -0.391 while alcohol’s was 0.476. I suggested to use the two chemical properties as main features for quality predicting model.

Plot Three

Next, let’s see among the chemical properties there was any strong correlations with each others and also with quality. The below plots were selected and improved from the multivariate plots section.

It was noted that I grouped the wine quality into 3 groups: bad (quality of 3 and 4 quality), good (quality of 5 and 6) and very good (quality of 7 and 8).

Statistic summary of average alcohol, sulphates, citric.acid and fixed.acidity vary with quality.

##      rating  alcohol sulphates citric.acid fixed.acidity
## 1       bad 10.21587 0.5922222   0.1736508      7.871429
## 2      good 10.25272 0.6472631   0.2582638      8.254284
## 3 very good 11.51805 0.7434562   0.3764977      8.847005

We found that:

  • Fixed.acidity and citric.acid strongly correlate with each other. Increasing average fixed.acidity from 7.87 to 8.85 and average citric.acid from 0.17 to 0.38 lead to increase wine rating from bad to very good.

  • Sulphates and alcohol strongly correlated with each other. Increasing average sulphates from 0.59 to 0.74 and average alcohol from 10.21 to 11.51 lead to increase wine rating from bad to very good.

It was interesting to note that the four chemical properties were highly correlated with wine quality as showed in plot 1. Particularly, fixed.acidity, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.124, 0.251, 0.226 and 0.476 respectively.

When we run modeling for predicting the quality we should careful select the features so two or three features are not too correlated.

Reflection